Skip to content

feat(ingest): minimal mode — parse→persist→ready, skip LLM enrichment + tables#30

Merged
hallelx2 merged 3 commits into
mainfrom
feat/minimal-ingest-mode
May 28, 2026
Merged

feat(ingest): minimal mode — parse→persist→ready, skip LLM enrichment + tables#30
hallelx2 merged 3 commits into
mainfrom
feat/minimal-ingest-mode

Conversation

@hallelx2
Copy link
Copy Markdown
Owner

@hallelx2 hallelx2 commented May 28, 2026

Why

Today, full ingest of a ~90-page 10-K does ~1,000–3,000 LLM calls (summarize every section + HyDE every leaf + multi-axis + a TOC build) AND a slow/hang-prone pdftable table-extraction pass — minutes of wall time before a document is ready.

But the engine's page-based retrieval strategy (/v1/answer/pageindex, the path that produces cited answers) needs none of that enrichment. It navigates a TOC tree (synthesising one from the section tree when documents.toc_tree is NULL) and reads raw section/page text at query time. So for that path we can collapse ingest to: parse → build section tree → persist → ready — roughly parse-speed (seconds).

This PR adds a minimal ingest mode that does exactly that.

What's skipped in minimal mode

  • summarize stage
  • HyDE candidate-question stage
  • multi-axis structured summarizer
  • LLM TOC-builder stage
  • pdftable table extraction — the parser registry is rebuilt with table opts nil regardless of ingest.tables.enabled (the table-finding pass is the slow/hang-prone part of parse; the page strategy reads raw page text, which still contains the table's text, so dropping table sections loses nothing for it)

The document flips to StatusReady immediately after the section tree is persisted. mode: full (the default) is unchanged.

Design

  • pkg/config: IngestConfig.Mode (yaml mode, full default | minimal), env override VLE_INGEST_MODE, Validate rejects unknown values.
  • internal/config (deployed server wrapper): forwards firstEnv("VLS_INGEST_MODE", "VLE_INGEST_MODE")c.Engine.Ingest.Mode, so the live vectorless-server flips with one env var, no secret edit.
  • pkg/ingest: Pipeline.Mode; Run dispatches to a new runMinimal when minimal. persistTree/parse/fail now take the persistence target through a narrow docPersister interface (*db.Pool satisfies it) so the minimal path is testable without a live Postgres.
  • cmd/engine + cmd/server set Mode from config and log when minimal mode is active.

How to enable

  • Per process / engine: VLE_INGEST_MODE=minimal
  • Deployed vectorless-server: VLS_INGEST_MODE=minimal (env-only, no secret/config edit)
  • Or in YAML: ingest.mode: minimal (engine config) / engine.ingest.mode: minimal (server config)

Test plan

  • pkg/ingest/minimal_mode_test.go — minimal-mode run with an LLM client that fails the test on any call reaches ready with sections persisted and a call counter of 0 (proof minimal mode is pure-Go). Asserts no summaries / axes / HyDE questions were written.
  • Companion test reconstructs the persisted tree → synthesised-TOC fallback is title-bearing and section bodies load back from storage.
  • pkg/retrieval TestPageIndexMinimalIngestedDoc — page-based strategy drives structure→get_pages→done end-to-end against a minimal-ingested doc shape (page ranges + content refs, no summaries, nil TOC) and returns a cited answer from the synthesised TOC + raw page reads.
  • pkg/config — default mode full; VLE_INGEST_MODE=minimal override; Validate accept/reject coverage.
  • go build ./..., go vet ./..., go test ./... all green (both binaries build).
  • ingest.mode documented in config.example.yaml + config.server.example.yaml.

Summary by CodeRabbit

  • New Features

    • Added ingest mode configuration option with "full" and "minimal" modes to control document processing.
    • Minimal mode parses and persists documents while skipping LLM enrichment and table extraction for faster ingestion.
    • Ingest mode configurable via environment variable or configuration file.
  • Tests

    • Added comprehensive test coverage for ingest mode configuration and minimal mode behavior.

Review Change Stack

hallelx2 added 3 commits May 28, 2026 23:40
Add IngestConfig.Mode (yaml `mode`, values full|minimal, default full)
to the engine config, with VLE_INGEST_MODE env override and Validate
rejecting unknown values. Forward it from the deployed server's config
wrapper via firstEnv("VLS_INGEST_MODE", "VLE_INGEST_MODE") so the live
vectorless-server can be flipped to minimal ingest with a single env
var, no secret edit.
…M/tables

Add Pipeline.Mode; when "minimal", Run dispatches to runMinimal which
does parse → build tree → persist → ready and skips every per-section
LLM stage (summarize, HyDE, multi-axis summaries, TOC build). The
parser registry is rebuilt with table extraction DISABLED (nil opts)
regardless of ingest.tables.enabled, since the pdftable table-finding
pass is the slow/hang-prone part of parse and the page-based strategy
reads raw page text (which still contains the table's text).

persistTree/parse/fail now take the persistence target through a narrow
docPersister interface (*db.Pool satisfies it) so the minimal path is
exercisable without a live Postgres. Both cmd/engine and cmd/server set
Mode from cfg.Ingest.Mode and log when minimal mode is active.
- pkg/ingest/minimal_mode_test.go: a minimal-mode pipeline run with an
  LLM client that fails the test on any call reaches StatusReady with
  sections persisted and a call counter of 0 — proving minimal ingest is
  pure-Go. A second test reconstructs the persisted tree and confirms the
  synthesised-TOC fallback is title-bearing and section bodies load back
  from storage.
- pkg/retrieval: TestPageIndexMinimalIngestedDoc drives the page-based
  strategy end-to-end against a minimal-ingested doc shape (page ranges +
  content refs, NO summaries, nil TOC) and asserts it produces a cited
  answer from the synthesised TOC + raw page reads.
- pkg/config: default mode is "full"; VLE_INGEST_MODE=minimal override
  and Validate accept/reject coverage.
- Document ingest.mode in both example configs.
Copilot AI review requested due to automatic review settings May 28, 2026 23:08
Copy link
Copy Markdown

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @hallelx2, you have reached your weekly rate limit of 500000 diff characters.

Please try again later or upgrade to continue using Sourcery

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR adds ingest mode configuration that allows documents to be marked ready after parsing and persisting, skipping expensive LLM enrichment stages (summarize, HyDE, multi-axis) and table/TOC extraction. The feature wires mode through config validation, pipeline branching, startup initialization, and includes comprehensive tests ensuring minimal-mode documents remain queryable.

Changes

Ingest Mode Feature

Layer / File(s) Summary
Configuration contract and validation
pkg/config/config.go, pkg/config/config_test.go, internal/config/config.go, config.example.yaml, config.server.example.yaml
IngestConfig gains a Mode field (full/minimal), defaulting to "full". Validate() enforces the enum, and applyEnvOverrides supports VLE_INGEST_MODE and VLS_INGEST_MODE environment variables. Configuration examples document the two modes and clarify that global_llm_concurrency is ignored in minimal mode.
Pipeline mode branching and persistence abstraction
pkg/ingest/ingest.go
Pipeline.Mode field branches Run() to either runMinimal() or the full enrichment path. New docPersister interface abstracts persistence operations so minimal mode skips the concrete DB pool. Refactored parse(), persistTree(), and fail() to accept parsers/store parameters, enabling both modes to share the same logical flow while executing different enrichment stages.
Startup wiring and logging
cmd/engine/main.go, cmd/server/main.go
Pipeline initialization receives cfg.Ingest.Mode (or cfg.Engine.Ingest.Mode for server). Startup branches on minimal mode to log stage-skipping behavior and bypass table-extraction configuration.
Minimal mode test coverage
pkg/ingest/minimal_mode_test.go
In-memory test doubles (fakeDocStore, failIfCalledLLM) and two tests verify minimal mode reaches ready, persists queryable sections without LLM calls, and that persisted sections reconstruct a valid tree with loadable content.
Retrieval strategy integration
pkg/retrieval/pageindex_strategy_test.go
PageIndex retrieval strategy test validates TOC synthesis and content loading when operating on minimal-ingested documents.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • hallelx2/vectorless-engine#24: Both PRs modify ingest pipeline behavior gating; this PR adds ingest.mode=minimal to skip enrichment stages, while that PR adds an LLM TOC stage conditional on ingest config.

Poem

🐰 A rabbit hops through modes so swift,
Parse and persist—minimal lift!
Full enrichment flows like carrot dreams,
While minimal skips the LLM schemes,
Ready or rich, the choice is clear! 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a minimal ingest mode that parses, persists, and marks documents ready while skipping LLM enrichment and table extraction.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/minimal-ingest-mode

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new minimal ingest mode that collapses the pipeline to parse → persist → ready, skipping all LLM enrichment stages (summarize, HyDE, multi-axis, TOC build) and the pdftable table-finding pass. This makes documents queryable in seconds via the page-based retrieval strategy, which doesn't need any of the skipped enrichment.

Changes:

  • New Pipeline.Mode field with ModeMinimal constant; Run dispatches to a new runMinimal that parses with RegistryFromTableOpts(nil), persists the section tree, and flips straight to StatusReady.
  • Introduces a docPersister interface over the persistence calls so the minimal path is testable without Postgres; parse, persistTree, and fail now take it as a parameter.
  • Wires ingest.mode config (with VLE_INGEST_MODE and forwarded VLS_INGEST_MODE overrides), validation, defaults, example YAMLs, and startup logging.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/ingest/ingest.go Adds ModeMinimal, docPersister, runMinimal; threads persister through parse/persistTree/fail.
pkg/ingest/minimal_mode_test.go New tests asserting zero LLM calls, ready status, and queryable post-ingest shape.
pkg/retrieval/pageindex_strategy_test.go New cross-package test proving page-based strategy answers a minimal-ingested doc (nil TOC, no summaries).
pkg/config/config.go Adds IngestConfig.Mode with default full, env override, and validation.
pkg/config/config_test.go Covers default, env override, and validate accept/reject for ingest.mode.
internal/config/config.go Forwards VLS_INGEST_MODE/VLE_INGEST_MODE to Engine.Ingest.Mode.
cmd/engine/main.go Sets Pipeline.Mode from config; logs when minimal mode is active.
cmd/server/main.go Same wiring + logging on the server binary.
config.example.yaml Documents the new ingest.mode option.
config.server.example.yaml Documents engine.ingest.mode plus the VLS_INGEST_MODE override.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hallelx2 hallelx2 merged commit dfc1c45 into main May 28, 2026
4 of 9 checks passed
@hallelx2 hallelx2 deleted the feat/minimal-ingest-mode branch May 28, 2026 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants